test: add tests for build pipeline scripts#28
Conversation
…idate CLI) Closes #23. Adds unit tests for all pipeline functions in scripts/build_v5_snapshot.py and CLI integration tests for scripts/validate_lead_scoring_dataset.py. Coverage: subsample edge cases (insufficient positives/negatives), inject_missingness rate bounds and source-conditional variation, derive_binary_features, cap_expected_acv, rename_and_select, boost_leakage_trap, and 6 CLI entrypoint tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds automated coverage for the v5 build/validation scripts by introducing unit tests for the build pipeline functions and subprocess-based CLI integration tests for the validator entrypoint.
Changes:
- Add unit tests for
scripts/build_v5_snapshot.pypipeline steps (feature derivation, ACV capping, renaming/selection, subsampling, missingness injection, leakage trap boost). - Add CLI integration tests for
scripts/validate_lead_scoring_dataset.pycovering key flags and exit codes. - Update
.agent-plan.mdto record completion of the test work.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/scripts/test_build_v5_snapshot.py | Unit tests for v5 snapshot pipeline functions (determinism, immutability, edge cases). |
| tests/scripts/test_validate_cli.py | Subprocess CLI integration tests for validator entrypoint and flags. |
| tests/scripts/init.py | Marks tests/scripts as a package (test organization). |
| .agent-plan.md | Notes the addition of build/validation script tests in the plan. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_valid_csv_exit_code_zero(self, tmp_path): | ||
| csv_path = _make_valid_csv(tmp_path) | ||
| result = subprocess.run( # noqa: S603 | ||
| [sys.executable, str(_SCRIPT_PATH), "--csv", str(csv_path)], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=120, | ||
| ) | ||
| assert result.returncode == 0, f"stdout: {result.stdout}\nstderr: {result.stderr}" | ||
|
|
||
| def test_invalid_csv_exit_code_one(self, tmp_path): | ||
| csv_path = _make_invalid_csv(tmp_path) | ||
| result = subprocess.run( # noqa: S603 | ||
| [sys.executable, str(_SCRIPT_PATH), "--csv", str(csv_path)], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=120, | ||
| ) | ||
| assert result.returncode == 1 | ||
|
|
||
| def test_out_json_flag(self, tmp_path): | ||
| csv_path = _make_valid_csv(tmp_path) | ||
| json_path = tmp_path / "report.json" | ||
| subprocess.run( # noqa: S603 | ||
| [ | ||
| sys.executable, | ||
| str(_SCRIPT_PATH), | ||
| "--csv", | ||
| str(csv_path), | ||
| "--out-json", | ||
| str(json_path), | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=120, | ||
| ) | ||
| # JSON report should be written regardless of pass/fail | ||
| assert json_path.exists() | ||
| report = json.loads(json_path.read_text()) | ||
| assert "passed" in report | ||
| assert "checks" in report | ||
|
|
||
| def test_emit_release_snippet_flag(self, tmp_path): | ||
| csv_path = _make_valid_csv(tmp_path) | ||
| result = subprocess.run( # noqa: S603 | ||
| [ | ||
| sys.executable, | ||
| str(_SCRIPT_PATH), | ||
| "--csv", | ||
| str(csv_path), | ||
| "--emit-release-snippet", | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=120, | ||
| ) | ||
| # Snippet should be emitted regardless of pass/fail | ||
| assert "RELEASE SNIPPET" in result.stdout | ||
|
|
There was a problem hiding this comment.
These CLI tests invoke the full validator in multiple subprocesses. Since validate_dataset() runs a scikit-learn baseline plus multi-seed leakage-trap evaluation (default 10 seeds), each invocation does a non-trivial amount of work. Consider consolidating some flag assertions into fewer subprocess runs (e.g., combine --out-json and --emit-release-snippet with the happy-path run) to keep the suite fast and reduce CI time.
| # All available positives should be included | ||
| assert result["converted"].sum() <= 10 |
There was a problem hiding this comment.
In test_insufficient_positives, the assertion result["converted"].sum() <= 10 doesn’t actually verify the intended behavior (“all available positives should be included”). A bug that drops positives (e.g., returning 0 positives) would still satisfy <= 10. Consider asserting that the number of positives in the result equals the number of positives available in the input (or at least equals min(n_pos_needed, positives_available)).
| # All available positives should be included | |
| assert result["converted"].sum() <= 10 | |
| positives_available = int(df["converted"].sum()) | |
| n_pos_needed = int(100 * 0.50) | |
| # All available positives should be included | |
| assert result["converted"].sum() == min(n_pos_needed, positives_available) |
| subsample(df, rng, n=100, target_rate=0.10) # need 90 negatives | ||
| captured = capsys.readouterr() | ||
| assert "WARNING" in captured.err |
There was a problem hiding this comment.
test_insufficient_negatives only checks that a warning is emitted, but doesn’t assert anything about the returned sample composition/size after the adjustment. To make this regression-proof, consider asserting the output length and that the number of negatives equals the available negatives (or the adjusted target) when negatives are insufficient.
| subsample(df, rng, n=100, target_rate=0.10) # need 90 negatives | |
| captured = capsys.readouterr() | |
| assert "WARNING" in captured.err | |
| result = subsample(df, rng, n=100, target_rate=0.10) # need 90 negatives | |
| captured = capsys.readouterr() | |
| assert "WARNING" in captured.err | |
| # With 200 rows at 95% conversion, only 10 negatives are available. | |
| # Requesting 100 rows at 10% conversion would require 90 negatives, | |
| # so the sample must be adjusted to include all 10 negatives and the | |
| # corresponding 10 positives, for a total of 20 rows. | |
| assert len(result) == 20 | |
| negative_count = (result["converted"] == 0).sum() | |
| positive_count = (result["converted"] == 1).sum() | |
| assert negative_count == 10 | |
| assert positive_count == 10 |
| def test_enforce_1000_flag_fails_on_small(self, tmp_path): | ||
| csv_path = _make_valid_csv(tmp_path, n=200) | ||
| result = subprocess.run( # noqa: S603 | ||
| [ | ||
| sys.executable, | ||
| str(_SCRIPT_PATH), | ||
| "--csv", | ||
| str(csv_path), | ||
| "--enforce-1000", | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=120, | ||
| ) | ||
| assert result.returncode == 1 |
There was a problem hiding this comment.
test_enforce_1000_flag_fails_on_small only asserts a non-zero exit code. If the 200-row fixture ever fails another validation check (e.g., baseline AUC), this test would still pass even if --enforce-1000 stopped affecting row-count enforcement. Consider asserting the failure reason via stdout (e.g., that the summary contains a failing row_count check / expected-row message), or running once without --enforce-1000 to prove it passes without enforcement.
…oundary tests - Extract shared `make_v5_dataset()` builder into `tests/conftest.py`, eliminating duplicate synthetic data generators across 3 test files - Delete 4 tautological missingness tests (trivially true at n=1000) and the weak `test_rows_come_from_input` (O(n^2), tests pandas internals) - Strengthen `test_insufficient_negatives` to assert output composition - Add `@pytest.mark.parametrize` for subsample target rate/seed combos and missingness rate bounds across multiple seeds - Add boundary tests: `subsample` with n > input, `boost_leakage_trap` with zero converted leads, `inject_missingness` with small n and unknown lead sources, `rename_and_select` with extra columns - Use session-scoped fixtures for CLI tests (avoid regenerating CSV 4x) - Add precondition assertion on fixture conversion rate so sigmoid drift produces a clear error message instead of mysterious CLI failure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #28 in repository https://github.com/leadforge-dev/leadforge
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: tests/scripts/test_validate_cli.py
URL: https://github.com/leadforge-dev/leadforge/pull/28#discussion_r3169531336
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
These CLI tests invoke the full validator in multiple subprocesses. Since `validate_dataset()` runs a scikit-learn baseline plus multi-seed leakage-trap evaluation (default 10 seeds), each invocation does a non-trivial amount of work. Consider consolidating some flag assertions into fewer subprocess runs (e.g., combine `--out-json` and `--emit-release-snippet` with the happy-path run) to keep the suite fast and reduce CI time.
## COPILOT-2
Location: tests/scripts/test_build_v5_snapshot.py:231
URL: https://github.com/leadforge-dev/leadforge/pull/28#discussion_r3169531378
Root author: copilot-pull-request-reviewer
Comment:
In `test_insufficient_positives`, the assertion `result["converted"].sum() <= 10` doesn’t actually verify the intended behavior (“all available positives should be included”). A bug that drops positives (e.g., returning 0 positives) would still satisfy `<= 10`. Consider asserting that the number of positives in the result equals the number of positives available in the input (or at least equals `min(n_pos_needed, positives_available)`).
~~~suggestion
positives_available = int(df["converted"].sum())
n_pos_needed = int(100 * 0.50)
# All available positives should be included
assert result["converted"].sum() == min(n_pos_needed, positives_available)
~~~
## COPILOT-3
Location: tests/scripts/test_build_v5_snapshot.py
URL: https://github.com/leadforge-dev/leadforge/pull/28#discussion_r3169531402
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`test_insufficient_negatives` only checks that a warning is emitted, but doesn’t assert anything about the returned sample composition/size after the adjustment. To make this regression-proof, consider asserting the output length and that the number of negatives equals the available negatives (or the adjusted target) when negatives are insufficient.
~~~suggestion
result = subsample(df, rng, n=100, target_rate=0.10) # need 90 negatives
captured = capsys.readouterr()
assert "WARNING" in captured.err
# With 200 rows at 95% conversion, only 10 negatives are available.
# Requesting 100 rows at 10% conversion would require 90 negatives,
# so the sample must be adjusted to include all 10 negatives and the
# corresponding 10 positives, for a total of 20 rows.
assert len(result) == 20
negative_count = (result["converted"] == 0).sum()
positive_count = (result["converted"] == 1).sum()
assert negative_count == 10
assert positive_count == 10
~~~
## COPILOT-4
Location: tests/scripts/test_validate_cli.py
URL: https://github.com/leadforge-dev/leadforge/pull/28#discussion_r3169531418
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`test_enforce_1000_flag_fails_on_small` only asserts a non-zero exit code. If the 200-row fixture ever fails another validation check (e.g., baseline AUC), this test would still pass even if `--enforce-1000` stopped affecting row-count enforcement. Consider asserting the failure reason via stdout (e.g., that the summary contains a failing `row_count` check / expected-row message), or running once without `--enforce-1000` to prove it passes without enforcement.Run metadata: |
Summary
scripts/build_v5_snapshot.pyandscripts/validate_lead_scoring_dataset.py), closing Add tests for build pipeline scripts (build_v4_snapshot.py, validate_v4_dataset.py) #23tests/scripts/test_build_v5_snapshot.pycovering all pipeline functions:derive_binary_features(),cap_expected_acv(),rename_and_select(),subsample(),inject_missingness(),boost_leakage_trap()tests/scripts/test_validate_cli.pycovering exit codes,--out-json,--emit-release-snippet,--enforce-1000, and missing argsKey edge cases tested
subsample()with insufficient positives/negatives (warns and adjusts)inject_missingness()rate bounds (<20%), source-conditional variation (SDR > inbound), columns not in spec unaffectedboost_leakage_trap()only affects converted leads, monotonically increases valuesrename_and_select()raises on missing columnsTest discovery
Scripts live in
scripts/(notleadforge/), so they're imported viaimportlib.util.spec_from_file_location()— no sys.path manipulation needed.Test plan
ruff checkandruff formatclean🤖 Generated with Claude Code
Closes #23